173 research outputs found
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
Data cleaning is naturally framed as probabilistic inference in a generative
model, combining a prior distribution over ground-truth databases with a
likelihood that models the noisy channel by which the data are filtered,
corrupted, and joined to yield incomplete, dirty, and denormalized datasets.
Based on this view, we present PClean, a unified generative modeling
architecture for cleaning and normalizing dirty data in diverse domains. Given
an unclean dataset and a probabilistic program encoding relevant domain
knowledge, PClean learns a structured representation of the data as a
relational database of interrelated objects, and uses this latent structure to
impute missing values, identify duplicates, detect errors, and propose
corrections in the original data table. PClean makes three modeling and
inference contributions: (i) a domain-general non-parametric generative model
of relational data, for inferring latent objects and their network of latent
connections; (ii) a domain-specific probabilistic programming language, for
encoding domain knowledge specific to each dataset being cleaned; and (iii) a
domain-general inference engine that adapts to each PClean program by
constructing data-driven proposals used in sequential Monte Carlo and particle
Gibbs. We show empirically that short (< 50-line) PClean programs deliver
higher accuracy than state-of-the-art data cleaning systems based on machine
learning and weighted logic; that PClean's inference algorithm is faster than
generic particle Gibbs inference for probabilistic programs; and that PClean
scales to large real-world datasets with millions of rows.Comment: Added references; revised abstrac
ADEV: Sound Automatic Differentiation of Expected Values of Probabilistic Programs
Optimizing the expected values of probabilistic processes is a central
problem in computer science and its applications, arising in fields ranging
from artificial intelligence to operations research to statistical computing.
Unfortunately, automatic differentiation techniques developed for deterministic
programs do not in general compute the correct gradients needed for widely used
solutions based on gradient-based optimization.
In this paper, we present ADEV, an extension to forward-mode AD that
correctly differentiates the expectations of probabilistic processes
represented as programs that make random choices. Our algorithm is a
source-to-source program transformation on an expressive, higher-order language
for probabilistic computation, with both discrete and continuous probability
distributions. The result of our transformation is a new probabilistic program,
whose expected return value is the derivative of the original program's
expectation. This output program can be run to generate unbiased Monte Carlo
estimates of the desired gradient, which can then be used within the inner loop
of stochastic gradient descent. We prove ADEV correct using logical relations
over the denotations of the source and target probabilistic programs. Because
it modularly extends forward-mode AD, our algorithm lends itself to a concise
implementation strategy, which we exploit to develop a prototype in just a few
dozen lines of Haskell (https://github.com/probcomp/adev).Comment: to appear at POPL 202
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought
How does language inform our downstream thinking? In particular, how do
humans make meaning from language -- and how can we leverage a theory of
linguistic meaning to build machines that think in more human-like ways? In
this paper, we propose \textit{rational meaning construction}, a computational
framework for language-informed thinking that combines neural models of
language with probabilistic models for rational inference. We frame linguistic
meaning as a context-sensitive mapping from natural language into a
\textit{probabilistic language of thought} (PLoT) -- a general-purpose symbolic
substrate for probabilistic, generative world modeling. Our architecture
integrates two powerful computational tools that have not previously come
together: we model thinking with \textit{probabilistic programs}, an expressive
representation for flexible commonsense reasoning; and we model meaning
construction with \textit{large language models} (LLMs), which support
broad-coverage translation from natural language utterances to code expressions
in a probabilistic programming language. We illustrate our framework in action
through examples covering four core domains from cognitive science:
probabilistic reasoning, logical and relational reasoning, visual and physical
reasoning, and social reasoning about agents and their plans. In each, we show
that LLMs can generate context-sensitive translations that capture
pragmatically-appropriate linguistic meanings, while Bayesian inference with
the generated programs supports coherent and robust commonsense reasoning. We
extend our framework to integrate cognitively-motivated symbolic modules to
provide a unified commonsense thinking interface from language. Finally, we
explore how language can drive the construction of world models themselves
From START to FINISH : the influence of osmotic stress on the cell cycle
Peer reviewedPublisher PD
- …